Bayesian clustering of block structured relational data
نویسنده
چکیده
The discovery of latent information in large-scale databases has become a major problem of information technology and statistics in the recent years. The investigation of relational data, e.g. the structures of graphs is an important field of this problem. One notable task is the clustering of the nodes of the graph. This means the discovery of cohesive groups and the modelling of the cohesion. Bayesian networks have become the most popular tools for modelling data having unobserved or latent variables, since are able to represent the causal relations between the latent and the observable variables. There are several inference methods for these networks, but they provide only general frameworks. The particular cases are depending on the chosen model, the distribution of the random variables, and in many cases they are not tractable. This thesis reviews the probabilistic methods of relational data modelling and the Bayesian inference procedures that are able to cluster and discover the latent structures of (occasionally huge) graphs. Based on the previous methods a new generative model is proposed for clustering graphs. For the estimation of the model parameters a Markov Chain Monte Carlo (MCMC) algorithm is used, by implementing two different type of samplers. The new method is compared to the previous algorithms. Small graphs are investigated with it in order to choose the best parameters and to measure the clustering efficiency. Based on the experiments, the new model turns out to be more efficient than the foregoer methods. Finally, the new tool is used on a real, large-sized dataset.
منابع مشابه
Heterogeneous Component Analysis
In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for f...
متن کاملMixed Membership Stochastic Block Models for Relational Data with Application to Protein-Protein Interactions
Modeling relational data is an important problem for modern data analysis and machine learning. In this paper we propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way objects interact with one another in order to learn latent groups, their typical interaction patterns, and the degree of membership of objects to groups. Our model explains the data using a smal...
متن کاملAn Approach to Inference in Probabilistic Relational Models using Block Sampling
We tackle the problem of approximate inference in Probabilistic Relational Models (PRMs) and propose the Lazy Aggregation Block Gibbs (LABG) algorithm. The LABG algorithm makes use of the inherent relational structure of the ground Bayesian network corresponding to a PRM. We evaluate our approach on artificial and real data, and show that it scales well with the size of the data set.
متن کاملA Hybrid Grey based Two Steps Clustering and Firefly Algorithm for Portfolio Selection
Considering the concept of clustering, the main idea of the present study is based on the fact that all stocks for choosing and ranking will not be necessarily in one cluster. Taking the mentioned point into account, this study aims at offering a new methodology for making decisions concerning the formation of a portfolio of stocks in the stock market. To meet this end, Multiple-Criteria Decisi...
متن کاملStochastic Block Models of Mixed Membership
Abstract. We consider the statistical analysis of a collection of unipartite graphs, i.e., multiple matrices of relations among objects of a single type. Such data arise, for example, in biological settings, collections of author-recipient email, and social networks. In such applications, typical analyses aim at: (i) clustering the objects of study or situating them in a low dimensional space, ...
متن کامل